Importing the data

df <- read.csv("C:/Users/Bodoque/Desktop/housing_data_CDMX_v2.csv")
head(df)
##   property_type        places               lat.lon   price currency
## 1     apartment MiguelHidalgo 23.634501,-102.552788 5500000      MXN
## 2         house    Iztapalapa   19.31033,-99.068557 1512000      MXN
## 3     apartment       Tlalpan  19.279771,-99.234597  926667      MXN
## 4     apartment MiguelHidalgo 23.634501,-102.552788 6410000      MXN
## 5     apartment MiguelHidalgo  19.432657,-99.177444 4416000      MXN
## 6     apartment  BenitoJuarez  19.367025,-99.170349 3150000      MXN
##   price_aprox_local_currency price_aprox_usd surface_total_in_m2
## 1                    5450246       289775.66                  54
## 2                    1498322        79661.96                  80
## 3                     918284        48822.82                 100
## 4                    6352013       337720.36                 135
## 5                    4376052       232663.51                  87
## 6                    3122244       166001.78                 100
##   surface_covered_in_m2 price_usd_per_m2 price_per_m2      lat        lon
## 1                    54        5366.2159    101851.85 23.63450 -102.55279
## 2                    80         995.7745     18900.00 19.31033  -99.06856
## 3                   100         488.2282      9266.67 19.27977  -99.23460
## 4                   135        2501.6323     47481.48 23.63450 -102.55279
## 5                    87        2674.2932     50758.62 19.43266  -99.17744
## 6                   100        1660.0178     31500.00 19.36703  -99.17035

I will delete the columns that are not useful for this analysis:

df <- df %>% 
  select(-c(lat.lon,currency,price,price_aprox_local_currency, price_per_m2))
summary(df) # Summary statistics
##  property_type         places            price_usd        surface_total_in_m2
##  Length:18234       Length:18234       Min.   :    5797   Min.   :       1   
##  Class :character   Class :character   1st Qu.:   59070   1st Qu.:      67   
##  Mode  :character   Mode  :character   Median :  120652   Median :      95   
##                                        Mean   :  270398   Mean   :    4062   
##                                        3rd Qu.:  250484   3rd Qu.:     195   
##                                        Max.   :17890000   Max.   :65748000   
##  surface_covered_in_m2 price_usd_per_m2         lat             lon         
##  Min.   :       1      Min.   :1.000e-03   Min.   :19.19   Min.   :-102.55  
##  1st Qu.:      67      1st Qu.:6.774e+02   1st Qu.:19.35   1st Qu.: -99.20  
##  Median :      95      Median :1.108e+03   Median :19.39   Median : -99.16  
##  Mean   :    4040      Mean   :1.504e+03   Mean   :19.46   Mean   : -99.22  
##  3rd Qu.:     200      3rd Qu.:1.898e+03   3rd Qu.:19.43   3rd Qu.: -99.14  
##  Max.   :65748000      Max.   :1.759e+05   Max.   :41.58   Max.   : -90.49
total_rows <- nrow(df)
total_cols <- ncol(df)
cat("Total rows: ",total_rows)
## Total rows:  18234
cat("\nTotal columns: ",total_cols)
## 
## Total columns:  8

Exploratory Data Analysis

The dataset contains 8 columns:

  1. property_type : Type of property (e.g., house, apartment)
  2. places : Borough of Mexico City where the property is located
  3. price_usd : Property price in dollars.
  4. surface_total_in_m2 : Total surface area in m2.
  5. surface_covered_in_m2 : Built surface in m2.
  6. price_usd_per_m2 : Property price per square meter in USD.
  7. lat : Latitude coordinate of the property location.
  8. lon : Longitude coordinate of the property location.

The histogram above shows the distribution of property prices in USD on a logarithmic scale. Most properties are concentrated between 50,000 and 300,000 USD, with a peak around 100,000–200,000 USD. A smaller number of high-value properties extend the distribution to over 1 million USD, which creates a long right tail. This indicates that the market is dominated by mid-priced properties, while luxury listings are rare but present.

The histogram above shows the distribution of total surface area (in square meters) using a logarithmic scale. Most properties cluster between 70 and 200 m2, with a strong peak around 100 m2, which reflects the typical size of apartments or small houses in Mexico City. A small number of records extend far beyond this range, representing outliers or possibly data entry errors with unrealistically large values.

The boxplot displays the distribution of total surface area (m2) on a logarithmic scale. Most properties are concentrated between 70 and 200 m2, with a median close to 100 m2. However, there are numerous outliers with much larger areas, indicating the presence of unusually large properties or data entry errors.

As we can see, all categories show extreme outliers, reflecting unusually large properties or potential data inconsistencies. Next, take a look at the distribution of covered surface area, it shows a pattern similar to what we observed earlier with the total surface area.

The histogram shows that most properties are concentrated between 1,000 and 5,000 USD per m2, with a clear peak around 2,000 USD per m2. The boxplot confirms this distribution, with the majority of values clustered in a narrow range but with numerous outliers on both ends. This indicates that while the market has a typical mid-range price, there are also extreme cases that deviate significantly from the norm.

Cleaning the Data

There are some values that do not make sense. For example, the column surface_total_in_m2 shows a minimum value of 1, and the column surface_covered_in_m2 also has 1. This is unrealistic because the area would be too small. I am going to explore the data further to determine whether these values need to be deleted.

minimal_area_m2 <- df %>%
  filter(surface_total_in_m2 < 70)
nrow(minimal_area_m2)
## [1] 4919

There are 4,919 records with a total surface area below 70 m2. These records will be removed to ensure the dataset contains only realistic values.

df <- df[df$surface_total_in_m2 >= 70, ] # Filter rows where surface_total_in_m2 >= 70. The empty part after the comma means keep all columns.
total_rows <- nrow(df)
cat("Total rows after filtering for surfaces greater than 70 m2: ",total_rows)
## Total rows after filtering for surfaces greater than 70 m2:  13315

We also observed outliers in the upper part of the box plot. This indicates that some properties have unusually large areas. Therefore, I am going to remove those records by applying the interquartile range (IQR) method.

cat("Rows before:", nrow(df), "\n")
## Rows before: 13315
cat("Rows after removing outliers:", nrow(df_no_outliers), "\n")
## Rows after removing outliers: 12187
cat("Removed:", nrow(df) - nrow(df_no_outliers), "rows\n")
## Removed: 1128 rows

Next, we will check for duplicates and null values.

df<-df_no_outliers 
df <- df %>% distinct()
cat("Duplicate rows deleted: ", total_rows - nrow(df), "\n")
## Duplicate rows deleted:  2911
cat("Null values: ",sum(is.na(df)))
## Null values:  0
total_rows <- nrow(df)
cat("Total rows: ",total_rows)
## Total rows:  10404

summary(df$surface_total_in_m2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      70      85     120     162     211     492

We need these results to match the covered area, so I will apply a filter that gathers the same information.

df <- df[df$surface_covered_in_m2 >= 70, ]
df <- df[df$surface_covered_in_m2 <= 492, ]
total_rows <- nrow(df)
cat("Total rows: ",total_rows)
## Total rows:  10277

summary(df)
##  property_type         places            price_usd        surface_total_in_m2
##  Length:9698        Length:9698        Min.   :    5797   Min.   : 70.0      
##  Class :character   Class :character   1st Qu.:   79567   1st Qu.: 85.0      
##  Mode  :character   Mode  :character   Median :  151210   Median :120.0      
##                                        Mean   :  261281   Mean   :158.9      
##                                        3rd Qu.:  274035   3rd Qu.:208.0      
##                                        Max.   :12314238   Max.   :492.0      
##  surface_covered_in_m2 price_usd_per_m2         lat             lon         
##  Min.   : 70.0         Min.   :    14.82   Min.   :19.19   Min.   :-102.55  
##  1st Qu.: 85.0         1st Qu.:   686.35   1st Qu.:19.35   1st Qu.: -99.20  
##  Median :118.0         Median :  1136.10   Median :19.39   Median : -99.17  
##  Mean   :156.6         Mean   :  1556.94   Mean   :19.45   Mean   : -99.22  
##  3rd Qu.:200.0         3rd Qu.:  2025.50   3rd Qu.:19.42   3rd Qu.: -99.14  
##  Max.   :492.0         Max.   :175917.69   Max.   :23.63   Max.   : -90.49

After removing the outliers in surface area, the price histogram looks more reasonable. The area outliers were probably distorting the price distribution, and once removed, the shape now clearly shows where most properties in Mexico City are concentrated. You can see the difference in the histogram below.

However, I can see a maximum value of 12,314,238 USD. That price could be real, but I doubt it, so I will review the record to see whether the surface area coincides or if it is an incorrectly recorded price.

high_price_df <- df %>%
  filter(price_usd > 12000000)

nrow(high_price_df)   
## [1] 1
head(high_price_df)  
##   property_type       places price_usd surface_total_in_m2
## 1     apartment BenitoJuarez  12314238                  70
##   surface_covered_in_m2 price_usd_per_m2      lat       lon
## 1                    70         175917.7 19.36045 -99.15689

It is an apartment with only 70 m2 of covered and total surface area, which is impossible, so I will drop it.

df <- df[df$price_usd <= 12000000, ]
df %>%
  arrange(price_usd) %>%
  head(10)
##    property_type        places price_usd surface_total_in_m2
## 1          store AlvaroObregon   5796.88                 251
## 2      apartment MiguelHidalgo   7000.00                 350
## 3      apartment    Cuajimalpa   7114.36                 480
## 4      apartment       Tlahuac   8693.26                  86
## 5          house    Xochimilco  15805.94                 230
## 6          house    Xochimilco  15809.69                 200
## 7      apartment AlvaroObregon  15893.77                  80
## 8          house      Coyoacan  17386.53                  70
## 9          house      Coyoacan  19005.85                  70
## 10         house      Coyoacan  19005.85                  80
##    surface_covered_in_m2 price_usd_per_m2      lat       lon
## 1                    251         23.09514 19.34652 -99.19129
## 2                    350         20.00000 19.42985 -99.18195
## 3                    480         14.82158 19.35735 -99.29979
## 4                     86        101.08442 19.28339 -99.05504
## 5                    230         68.72148 19.24157 -99.11103
## 6                    200         79.04845 19.24102 -99.10752
## 7                     80        198.67212 19.39081 -99.19541
## 8                     70        248.37900 19.31224 -99.10609
## 9                     70        271.51214 19.35021 -99.16215
## 10                    80        237.57312 19.32377 -99.11656

There are 4 values that are far from the others in terms of price. For the sake of simplifying the data, I will drop them.

Data Consistency Checks

df <- df %>%
  mutate(
    price_usd_per_m2 = price_usd / surface_total_in_m2,
    covered_ratio = surface_covered_in_m2 / surface_total_in_m2)

stopifnot(all(df$surface_total_in_m2 >= df$surface_covered_in_m2, na.rm = TRUE))
summary(df[, c("price_usd","price_usd_per_m2","surface_total_in_m2","surface_covered_in_m2","covered_ratio")])
##    price_usd       price_usd_per_m2   surface_total_in_m2 surface_covered_in_m2
##  Min.   :  15806   Min.   :   68.69   Min.   : 70.0       Min.   : 70.0        
##  1st Qu.:  79575   1st Qu.:  686.68   1st Qu.: 85.0       1st Qu.: 85.0        
##  Median : 151210   Median : 1136.25   Median :120.0       Median :118.0        
##  Mean   : 260142   Mean   : 1539.58   Mean   :158.9       Mean   :156.6        
##  3rd Qu.: 274035   3rd Qu.: 2025.50   3rd Qu.:208.0       3rd Qu.:200.0        
##  Max.   :8429837   Max.   :31876.47   Max.   :492.0       Max.   :492.0        
##  covered_ratio   
##  Min.   :0.2575  
##  1st Qu.:1.0000  
##  Median :1.0000  
##  Mean   :0.9918  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

In the picture above, we can see some points where the price is very high while the total surface area is low. This may be a data entry error or could be explained by the location. In any case, I will analyze these occurrences further.

summary(df[, c("surface_total_in_m2", "price_usd")])
##  surface_total_in_m2   price_usd      
##  Min.   : 70.0       Min.   :  15806  
##  1st Qu.: 85.0       1st Qu.:  79575  
##  Median :120.0       Median : 151210  
##  Mean   :158.9       Mean   : 260142  
##  3rd Qu.:208.0       3rd Qu.: 274035  
##  Max.   :492.0       Max.   :8429837
df_flags <- df %>%
  mutate(
    flag_small_highprice = surface_total_in_m2 < 85  & price_usd > 274035,
    flag_large_lowprice  = surface_total_in_m2 > 208  & price_usd < 79575,
    flag_any = flag_small_highprice | flag_large_lowprice
  )


cat("Small and very expensive:", sum(df_flags$flag_small_highprice), "\n")
## Small and very expensive: 27
cat("Large and very cheap:", sum(df_flags$flag_large_lowprice), "\n")
## Large and very cheap: 103
cat("Total outliers:", sum(df_flags$flag_any), "\n")
## Total outliers: 130
top_bottom <- df_flags %>%
  filter(flag_any) %>%
  arrange(desc(price_usd))

bind_rows(
  head(top_bottom, 5),
  tail(top_bottom, 5)
)
##    property_type        places price_usd surface_total_in_m2
## 1      apartment      Coyoacan 790297.25                  70
## 2          house      Coyoacan 726546.61                  70
## 3      apartment MiguelHidalgo 500000.00                  76
## 4      apartment    Cuauhtemoc 395945.99                  79
## 5          house  BenitoJuarez 385138.19                  80
## 6      apartment    Cuauhtemoc  27923.83                 370
## 7      apartment    Cuauhtemoc  27676.20                 380
## 8      apartment    Cuauhtemoc  26994.23                 393
## 9      apartment    Cuauhtemoc  21795.44                 281
## 10         house    Xochimilco  15805.94                 230
##    surface_covered_in_m2 price_usd_per_m2      lat       lon covered_ratio
## 1                     70      11289.96071 19.31608 -99.12644             1
## 2                     70      10379.23729 19.35381 -99.16361             1
## 3                     76       6578.94737 19.40172 -99.23482             1
## 4                     79       5011.97456 19.43072 -99.15837             1
## 5                     80       4814.22738 19.39551 -99.18424             1
## 6                    370         75.46981 19.45034 -99.12606             1
## 7                    380         72.83211 19.44784 -99.14040             1
## 8                    393         68.68761 19.44682 -99.13039             1
## 9                    281         77.56384 19.45178 -99.15709             1
## 10                   230         68.72148 19.24157 -99.11103             1
##    flag_small_highprice flag_large_lowprice flag_any
## 1                  TRUE               FALSE     TRUE
## 2                  TRUE               FALSE     TRUE
## 3                  TRUE               FALSE     TRUE
## 4                  TRUE               FALSE     TRUE
## 5                  TRUE               FALSE     TRUE
## 6                 FALSE                TRUE     TRUE
## 7                 FALSE                TRUE     TRUE
## 8                 FALSE                TRUE     TRUE
## 9                 FALSE                TRUE     TRUE
## 10                FALSE                TRUE     TRUE

There are 130 records that meet the established conditions. Before deciding to drop them, I will plot them by borough and property type.

The five boroughs with the most values are Cuauhtémoc, Benito Juárez, Gustavo A. Madero, Miguel Hidalgo, and Álvaro Obregón.

Apartments and houses are the ones with the most flagged outliers.

In the plot above, we see that stores and houses almost always have low prices, whereas apartments tend to have higher prices.

In the plot above, we see that stores are the property type with the smallest total surface area, followed by houses, and then apartments. Apartments range from approximately 70 m2 to the maximum recorded value of 492 m2.

df_flags %>%
  ggplot(aes(x = flag_any, y = price_usd_per_m2)) +
  geom_boxplot(fill = "#D0BBFC", outlier.alpha = 0.3) +
  scale_y_log10(labels = scales::comma) +
  labs(title = "Comparison of USD/m2 between normal and flagged records",
       x = "Flagged as outlier", y = "USD per m2 (log)") +
  theme_minimal()

The boxplot compares the distribution of price per m2 between normal records (FALSE) and those flagged as outliers (TRUE).

Normal records show a median around 1,000 USD/m2, with most values between 500 and 2,000 USD/m2, which is consistent with typical market ranges in Mexico City.

Flagged records, however, are shifted downward, with a median closer to 500 USD/m2. They also present extreme high values (>10,000 USD/m2), suggesting inconsistencies such as very cheap large properties or very expensive small ones.

This confirms that the flagged group behaves very differently from the main distribution, supporting the case for excluding them.

df_clean <- df_flags %>%
  filter(!flag_any) %>%
  select(-starts_with("flag_"))

cat("Rows before:", nrow(df_flags), "\n")
## Rows before: 9693
cat("Rows after removing outliers:", nrow(df_clean), "\n")
## Rows after removing outliers: 9563
cat("Removed:", nrow(df_flags) - nrow(df_clean), "rows\n")
## Removed: 130 rows
df <- df_clean

Importance of Removing Outliers

p1 <- ggplot(df_bf, aes(x = price_aprox_usd)) +
  geom_histogram(bins = 60, fill = "#D0BBFC", color = "white") +
  scale_x_log10(labels = scales::comma) +
  labs(title = "Before Cleaning", x = "Price (USD, log10)", y = "Count") +
  theme_minimal()

p2 <- ggplot(df, aes(x = price_usd)) +
  geom_histogram(bins = 60, fill = "#9AD0EC", color = "white") +
  scale_x_log10(labels = scales::comma) +
  labs(title = "After Cleaning", x = "Price (USD, log10)", y = "Count") +
  theme_minimal()
p1 + p2

Before cleaning: the distribution is wider and strongly skewed to the right, with extreme values above 5 million USD. These outliers inflate the tail of the distribution and make the market appear more dispersed than it really is.

After cleaning: the distribution becomes more compact and symmetric. Most properties are concentrated between 50,000 and 500,000 USD, with fewer extreme values.

KPI’s by Borough (Places)

## # A tibble: 16 × 6
##    places                n median_price_usd mean_price_usd median_ppm2 mean_ppm2
##    <chr>             <int>            <dbl>          <dbl>       <dbl>     <dbl>
##  1 MiguelHidalgo      1311          526865.        664846.       2581.     2988.
##  2 Cuajimalpa          469          463641.        573380.       2222.     2293.
##  3 BenitoJuarez       2325          160852.        186200.       1581.     1575.
##  4 Cuauhtemoc          961          127005.        192282.       1133.     1541.
##  5 AlvaroObregon      1190          160047.        225132.       1100.     1366.
##  6 Tlalpan             592          152809.        210642.        967.     1132.
##  7 MagdalenaContrer…   191          177821.        230125.        905.     1130.
##  8 Coyoacan            767          115910.        154848.        887.     1056.
##  9 Xochimilco          150          141592.        151826.        770.      816.
## 10 Azcapotzalco        303           72391.         94553.        704.      736.
## 11 VenustianoCarran…   179           71864.         97306.        681.      773.
## 12 Iztacalco           178           76149.         98262.        657.      710.
## 13 GustavoAMadero      522          102751.        122195.        656.      794.
## 14 Tlahuac              59           86383.        102737.        652.      658.
## 15 Iztapalapa          365           83791.         98413.        618.      697.
## 16 MilpaAlta             1           39524.         39524.        395.      395.

## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

There are some points that are incorrectly located, so I will drop them.

# Apply the filter before dropping anything to check that the record does not match the locations.
df_filtered <- df %>% filter(surface_total_in_m2 == 110 & places == "AlvaroObregon" & price_usd_per_m2 >= 1504 & price_usd_per_m2 <= 1505)
df_filtered %>% head(19) 
##   property_type        places price_usd surface_total_in_m2
## 1         house AlvaroObregon  165474.8                 110
##   surface_covered_in_m2 price_usd_per_m2      lat       lon covered_ratio
## 1                   110         1504.316 19.53541 -99.24323             1
nrow(df_filtered)
## [1] 1
df <- df %>%
  filter(!(surface_total_in_m2 == 195 & places == "GustavoAMadero" & price_usd == 151737.07))

df <- df %>%
  filter(!(surface_total_in_m2 == 300 & places == "AlvaroObregon" & price_usd >= 165435 & price_usd <= 165436))

df <- df %>%
  filter(!(surface_total_in_m2 == 110 & places == "AlvaroObregon" & price_usd_per_m2 >= 1504 & price_usd_per_m2 <= 1505))

df <- df %>%
  filter(!(surface_total_in_m2 == 364 & places == "Cuajimalpa" & price_usd_per_m2 >= 694))

df <- df %>%
  filter(!(surface_total_in_m2 == 284 & places == "Cuajimalpa" & price_usd_per_m2 >= 408))

df <- df %>%
  mutate(
    places = ifelse(surface_total_in_m2 == 125 & places == "Cuauhtemoc" & price_usd_per_m2 >= 1970 & price_usd_per_m2 <=1972,
                    "Xochimilco", places))
library(broom)

model_df <- df %>%
  mutate(log_price = log(price_usd),
         log_total = log(surface_total_in_m2)) %>%
  filter(is.finite(log_price), is.finite(log_total)) %>%
  select(log_price, log_total, property_type, places, covered_ratio)

m1 <- lm(log_price ~ log_total + covered_ratio + property_type + places, data = model_df)

glance(m1)   # R2, adj.R2, etc.
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.623         0.622 0.573      788.       0    20 -8226. 16497. 16654.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(m1) %>% arrange(p.value) %>% head(12)
## # A tibble: 12 × 5
##    term                     estimate std.error statistic   p.value
##    <chr>                       <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                 5.65     0.141      40.1  0        
##  2 log_total                   1.23     0.0136     90.6  0        
##  3 property_typehouse         -0.409    0.0166    -24.6  6.25e-130
##  4 placesMiguelHidalgo         0.547    0.0240     22.8  5.43e-112
##  5 placesIztapalapa           -0.459    0.0346    -13.3  6.78e- 40
##  6 placesGustavoAMadero       -0.389    0.0304    -12.8  3.09e- 37
##  7 placesCuajimalpa            0.399    0.0324     12.3  1.53e- 34
##  8 placesAzcapotzalco         -0.453    0.0370    -12.2  3.19e- 34
##  9 placesIztacalco            -0.456    0.0461     -9.89 6.22e- 23
## 10 placesVenustianoCarranza   -0.422    0.0460     -9.16 6.37e- 20
## 11 placesBenitoJuarez          0.166    0.0208      7.98 1.69e- 15
## 12 placesTlahuac              -0.536    0.0765     -7.01 2.60e- 12